Project 3 - Movie Review Sentiment Analysis¶

STAT 542 Statistical Learning - Fall 2022¶

In [1]:
library("IRdisplay")
display_png(file="imdb.png", width = 1000)
Warning message:
"package 'IRdisplay' was built under R version 4.2.3"
No description has been provided for this image

1. Introduction¶

  • Sentiment analysis is a tool for classifying people's impression of a product or topic.
  • Using sentiment analysis, an algorithm can read a text written in natural language and map it to a scale ranging from positive to negative feelings.
  • Because it is able to interpret human language, sentiment analysis is widely used in many online platforms.
  • For instance, companies use sentiment analysis to gain direct feedback of customers of a product.

2. Overview of the project¶

  • In this project, we create a sentiment analysis model to interpret movie reviews in the IMDB website.
  • The data consists of 50000 reviews which have scores from 0 to 10.
  • If the score is below or equal to 4, it is classified as negative. If the score is 7 or bigger, it is classified as positive (the reviews with scores of 5 and 6 are excluded from this analysis).
  • The goal of the algorithm is to guess if the review is negative or positive directly from the text written by the reviewer.

3. Data processing¶

The data set, located in the file "alldata.tsv", is a table with 50000 rows and 4 columns. Each row represents a review, while the columns represent:

  1. Id, the identification number of each review

  2. Sentiment, 0 for negative and 1 for positive

  3. Score, from 0 to 10 and excluding 5 and 6.

  4. Review, the actual text written by the reviewer.

The data processing consists of 2 main steps: (1) removing irrelevant symbols and (2) filtering the text using a vocabulary.

4. Model¶

We use both R and Python to build our model. R is first used to identify the meaningful terms in the reviews. The vocabulary of extracted terms is then passed to Python to build the final model.

The detailed steps to build and test our classification model are:

  1. Clean the data by removing punctuation marks and stop words.

  2. From the cleaned reviews, build a vocabulary using all appearing N-grams terms, with N varying between 1 to 4 (individual words up to 4 sequential words).

  3. Vectorize the reviews using Count Vectorization of the N-grams.

  4. Reduce the size of the vocabulary to less than 1000 terms by using Logistic Regression with Lasso regularization (main step done in R on this notebook).

  5. Re-vectorize all reviews using the reduced vocabulary of N-grams found by Lasso.

  6. Train a Neural Network by feeding the review vectors (main step done in Python on the next notebook).

  7. Evaluate the AUC of the Neural Network model on a test set.

In the next few sessions, the walk through all the computations done in this Project.


Part I. Vocabulary Reduction using R¶

In [2]:
library("text2vec")
library("glmnet")
Loading required package: Matrix

Loaded glmnet 4.1-4

Generating the vocabulary¶

First, we extract all data and take a look at it

In [3]:
train = read.table("alldata.tsv",
                   stringsAsFactors = FALSE,
                   header = TRUE)

head(train,n = 2L)
A data.frame: 2 × 4
idsentimentscorereview
<int><int><int><chr>
11110Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty.
220 2This movie is a disaster within a disaster film. It is full of great action scenes, which are only meaningful if you throw away all sense of reality. Let's see, word to the wise, lava burns you; steam burns you. You can't stand next to lava. Diverting a minor lava flow is difficult, let alone a significant one. Scares me to think that some might actually believe what they saw in this movie.<br /><br />Even worse is the significant amount of talent that went into making this film. I mean the acting is actually very good. The effects are above average. Hard to believe somebody read the scripts for this and allowed all this talent to be wasted. I guess my suggestion would be that if this movie is about to start on TV ... look away! It is like a train wreck: it is so awful that once you know what is coming, you just have to watch. Look away and spend your time on more meaningful content.

Remove stop-words from the list given below

In [4]:
stop_words = c("i", "me", "my", "myself", 
               "we", "our", "ours", "ourselves", 
               "you", "your", "yours", 
               "their", "they", "his", "her", 
               "she", "he", "a", "an", "and",
               "is", "was", "are", "were", 
               "him", "himself", "has", "have", 
               "it", "its", "the", "us")

it_train = itoken(train$review,
                  preprocessor = tolower, 
                  tokenizer = word_tokenizer)

tmp.vocab = create_vocabulary(it_train, 
                              stopwords = stop_words, 
                              ngram = c(1L,4L))

tmp.vocab = prune_vocabulary(tmp.vocab, term_count_min = 10,
                             doc_proportion_max = 0.5,
                             doc_proportion_min = 0.001)

dtm_train = create_dtm(it_train, vocab_vectorizer(tmp.vocab))
as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead

Now, use Logistic Regression with Lasso Regularization to reduce the number of words in the vocabulary

In [5]:
set.seed(3213)
tmpfit = glmnet(x = dtm_train, 
                y = train$sentiment, 
                alpha = 1,
                family='binomial')
tmpfit$df
  1. 0
  2. 1
  3. 2
  4. 3
  5. 4
  6. 4
  7. 6
  8. 7
  9. 11
  10. 15
  11. 18
  12. 22
  13. 25
  14. 39
  15. 48
  16. 57
  17. 67
  18. 83
  19. 97
  20. 114
  21. 131
  22. 153
  23. 174
  24. 206
  25. 238
  26. 270
  27. 303
  28. 338
  29. 389
  30. 437
  31. 489
  32. 561
  33. 643
  34. 740
  35. 859
  36. 982
  37. 1126
  38. 1281
  39. 1471
  40. 1711
  41. 1962
  42. 2262
  43. 2585
  44. 2934
  45. 3273
  46. 3667
  47. 4087
  48. 4497
  49. 4890
  50. 5321
  51. 5735
  52. 6158
  53. 6587
  54. 7018
  55. 7400
  56. 7747
  57. 8086
  58. 8447
  59. 8778
  60. 9054
  61. 9376
  62. 9646
  63. 9886
  64. 10117
  65. 10354
  66. 10574
  67. 10760
  68. 10983
  69. 11150
  70. 11286
  71. 11421
  72. 11529
  73. 11666
  74. 11776
  75. 11860
  76. 11965
  77. 12053
  78. 12142
  79. 12232
  80. 12352
  81. 12419
  82. 12495
  83. 12542
  84. 12594
  85. 12631
  86. 12663
  87. 12705
  88. 12738
  89. 12776
  90. 12914
  91. 12882
  92. 12994
  93. 13038
  94. 13076
  95. 13115
  96. 13141
  97. 13190
  98. 13212
  99. 13253
  100. 13270

Then, take the largest vocabulary size which is less than 2000 (position 41):

In [6]:
myvocab = colnames(dtm_train)[which(tmpfit$beta[, 41] != 0)]

Now, use this smaller vocabulary to run another Lasso step:

In [7]:
it_train = itoken(train$review,
                    preprocessor = tolower, 
                    tokenizer = word_tokenizer)

vectorizer = vocab_vectorizer(create_vocabulary(myvocab, 
                                                  ngram = c(1L, 2L)))
dtm_train = create_dtm(it_train, vectorizer)


set.seed(3213)
tmpfit = glmnet(x = dtm_train, 
                y = train$sentiment, 
                alpha = 1,
                family='binomial')
tmpfit$df
  1. 0
  2. 1
  3. 2
  4. 3
  5. 4
  6. 4
  7. 6
  8. 7
  9. 11
  10. 14
  11. 17
  12. 21
  13. 24
  14. 39
  15. 47
  16. 55
  17. 66
  18. 83
  19. 97
  20. 111
  21. 127
  22. 149
  23. 169
  24. 197
  25. 225
  26. 261
  27. 287
  28. 324
  29. 367
  30. 412
  31. 454
  32. 527
  33. 589
  34. 675
  35. 774
  36. 875
  37. 1005
  38. 1131
  39. 1287
  40. 1452
  41. 1572
  42. 1630
  43. 1659
  44. 1672
  45. 1682
  46. 1689
  47. 1693
  48. 1698
  49. 1699
  50. 1701
  51. 1703
  52. 1706
  53. 1709
  54. 1711
  55. 1715
  56. 1720
  57. 1722
  58. 1728
  59. 1731
  60. 1733
  61. 1736
  62. 1740
  63. 1740
  64. 1743
  65. 1743
  66. 1745
  67. 1746
  68. 1750
  69. 1750
  70. 1751
  71. 1754
  72. 1757
  73. 1758
  74. 1760
  75. 1763
  76. 1765
  77. 1767
  78. 1768
  79. 1769
  80. 1770
  81. 1770
  82. 1773
  83. 1773
  84. 1773
  85. 1776
  86. 1776
  87. 1777
  88. 1778
  89. 1778
  90. 1780
  91. 1782
  92. 1783
  93. 1783

Finally, we take the df closest to 1000 (position 37, df = 1005):

In [8]:
myvocab2 = colnames(dtm_train)[which(tmpfit$beta[, 37] != 0)]

Then, export the term in this vocabulary to csv:

In [9]:
write.csv(myvocab2, "myvocab2.csv")